Similarity Model and Term Association for Document Categorization
نویسندگان
چکیده
This paper addresses similarity model and term association for similarity-based document categorization. Both Euclidean distance– and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization community. These two similarity models are based on the assumption that term vectors are orthogonal. Term associations are ignored in such similarity models. In fact, the assumption above is not true. In the context of document categorization, we analyze the properties of term-document space, termcategory space and category-document space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms. Different from other models of term relationship, here we make best use of existing category membership represented by corpus as more as possible, and the objective is to improve categorization performance. By introducing association between terms, we take into account term associations for calculating document similarity and define a -similarity model of documents. Experiments have been done with kNN classifier over Reuters-5178 corpus. The empirical results show that utilization of term association can improve the effectiveness of categorization system and -similarity model outperforms than ones without considering term association.
منابع مشابه
Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization
The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual document is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a lowdimensional (curved) multinomial subfamily is learned. From this model a canonical similarit...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملConcept Mining: A Conceptual Understanding based Approach
Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content...
متن کاملText Mining
“Bag of words” model, acronym extraction, authorship ascription, coordinate matching, data mining, document clustering, document frequency, document retrieval, document similarity metrics, entity extraction, hidden Markov models, hubs and authorities, information extraction, information retrieval, key-phrase assignment, key-phrase extraction, knowledge engineering, language identification, link...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002